Attorney Docket: 2442/ 1 10 
Method and Apparatus for Reordering Received Messages 
for Improved Processing Performance 

5 Technical Field 

The present invention relates to methods for processing data messages 
received from a communication network, and more particularly, to a method 
for efficiently handling simultaneous streams of messages from dllferent 
sources. 

10 Background Art 

In commimication networks, data messages often arrive at a node in an 
interleaved stream from multiple sources. Hie protocol for data message 
rz exchange among nodes can include strict ordering reqmrements on the 

processing of messages exchanged between two nodes. If the receiving node 
^:j.5 must wait to acquire resources, such as memory, to process a received 

message, these ordering requfrements can impose a processing bottleneck. 
|y One such communication network is implemented according to the 

f:; Infmiband™ Architecture Specification developed by the InfinibandSM Trade 
S rJ Association, the specification for which is incorporated herein by reference 
i;:20 (Infiniband^'^ Architecture Specification, version 1.0 ). The Infiniband™ 
Architecture defines a system area network for cormecting multiple 
independent processor platforms (i.e., host processor nodes), input/output 
("lO") platforms, and lO devices as is shoAvn in Fig. 1. The system 100 is a 
communications and management infrastructure supporting both 10 and 
25 tnterprocessor communications for one or more computer systems. The 
system 100 can range from a small server with one processor and a few 10 
devices to a massively parallel supercomputer installation with hundreds of 
processors and thousands of 10 devices. Communication among nodes is 
accomplished according to an Infiniband™ protocol. In addition, the IP 
30 (Internet protocol) friendly nature of the architecture allows bridging to an 

1 



Internet, intranet, or connection to remote computer systems 111. 

The Infiniband™ architecture defines a switched communications fabric 
101 allowing many devices to concurrently commimicate with high 
bandwidth and low latency in a protected, remotely managed environment. 
5 The system 100 consists of processor nodes 102, 103, and 104 and 10 units 
105, 106, 107, and 108 cormected through the fabric 101. The fabric is 
made up of cascaded switches 109 and routers 110. lO units can range in 
complexity from a single attached device, such as a SCSI or LAN adapter to 
large memory rich RAID subsystems 107. 
10 The foundation of the Infmiband™ operation is the ability of a client 

process to queue up a set of instructions that hardware devices or nodes, 
such as a channel adapter 112, switch 109, or router 110 execute. This 
facility is referred to as a work queue. Work queues are always created in 
pairs consisting of a send work queue and a receive work queue. The send 
15 work queue holds instructions that cause data to be transferred between the 
client's memoiy and another process's memory. The receive work queue 
holds instructions about where to place data that is received from another 
process. Each node may provide a plurality of queue pairs, each of which 
provide independent virtual communication ports. 
20 A queue pair can support various types of service which determine the 

types of messages used to communicate on that queue pair. Message types 
include request type messages and response type messages. Infiniband™ 
requires that the request messages on a queue pair be processed in a speciEic 
order. Similar ordering requirements are imposed on response messages. 
25 (See section 9.5 of the Infiniband™ Architecture Specification for ordering 
requirements.) Certain request message types (e.g.. Send messages) require 
information to be fetched identifying resources so that the message can be 
processed. For example, a Send message requires a work queue element 
("WQE") to be fetched. The WQE specifies the list of virtual addresses where 
30 the data in the SEND message is to be stored on the receiving node. In a 
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large system configuration the latency for the operating system ("OS") to fetch 
a WQE can be quite high. If messages are processed serially as received from 
an Infinlband™ network, a significant performance penalty can resiilt. 

Summary of the Invention 
In accordance with an embodiment of the present invention, a method 
is provided for reordering messages received from a communication network, 
for processing. A message store is provided for received messages. A 
plurality of FIFO queues receive tags corresponding to storage slots in the 
message store. A received message is enqueued for reordering by storing the 
message in a free slot in the message store. A FIFO queue is selected based 
at least on the message's source and type. A tag corresponding to the 
message's storage slot is then loaded onto the selected FIFO queue. When 
messages are ready for further processing, a FIFO queue is selected among 
the queues that have tags at the head of the queue that corresponding to 
messages that are ready for further processing. The corresponding message 
slot is freed and the tag is removed from the head of the selected FIFO queue. 

In a specific embodiment of the present invention, messages received 
from an Infiniband^^ network may be reordered such that all request 
messages or, alternatively, all response messages received from a single 
queue pafr are processed in order. This ordering is enforced by using a FIFO 
queue to hold tags for all messages awaiting processing, that (1) were received 
on the same queue pair and (2) are of the same message type. This 
embodiment allows processing to proceed contemporaneously among 
messages received on different queue pafrs and between response and 
request messages received on the same queue pafr. This embodiment can 
advantageously increase message processing efftciency and reduce the 
latency time in processing received messages by reducing bottlenecks due to 
contention for resources. 

A message reordering device, that is part of a node, is provided in 
accordance with an embodiment of the present invention. The device 
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includes a message store, that includes a plurality of storage slots. The 
device further includes a plurality of FIFO tag queues. The device includes 
logic for enqueuing a received message. Enqueueing a received message 
includes storing a message in a storage slot identified by a given tag; selecting 

5 a FIFO tag queue based at least on source identifier and message type for the 
message; and loading the given tag onto the selected FIFO tag queue. When a 
message is ready for further processing, because the corresponding tag is at 
the head of any tag queue and the node has acquired the resources for 
processing the message further, logic arbitrates among the ready messages 

10 for selection. Logic frees the storage slot for the selected message and 
removes the tag for the message from the head of the corresponding FIFO 
queue. 

Brief Description of the Drawings 
The foregoing features of the invention will be more readily imderstood 
15 by reference to the following detailed description, taken with reference to the 
accompanying drawings, in which: 

Fig. 1 is a block diagram illustrating a system area network in which 
an embodiment of the present invention may be employed; 

Fig. 2 is a flow chart illustrating an embodiment of the invention; 
20 Fig. 3 is a block diagram of a message management device according 

to an embodiment of the present invention. 

Fig. 4 is a flow chart further illustrating an embodiment of the 
invention. 

25 Detaaed Description of Specific Embodiments 

Fig 2. is a flow chart showing a method of reordering messages received 
from a plurality of sending nodes, so that the stream of received messages 
may be processed efficiently according to an embodiment of the present 
invention. Fig. 3 is a block diagram of a message management device 155 

30 employed in this method, that is part of a node 150. A node processes or 
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forwards messages. A node, as used herein, includes an independent 
processor platform, an input/output platform, an I/O device or other such 
processing nodes. Receipt of a message at the node 150 triggers receive 
message processing 230, with the message initially stored in an incoming 
5 message FIFO queue 160. Logic checks 240 whether there is an empty 

storage location in a message store 170. The message store has a plurality of 
storage locations, or slots 175, for storing messages. Slots may be 
implemented in any storage medium, including, without limitation, volatile 
memory, nonvolatile memory, disk drives, optical storage, and hardware 
10 registers. If an empty slot is not available in the message store 170, slot 

availability is checked again after a delay time 245. When a slot is available, 
the message is loaded 250 into that slot. The slot is identified by a tag. A tag 
queue is then selected 260 from a plurality of first-in-first out ("FIFO") tag 
queues. If a tag queue already has a tag corresponding to a message of the 
15 same message type that arrived from the same source, the tag is loaded onto 
that tag queue 265. Otherwise, the tag is loaded onto an empty tag FIFO 
queue 275. Logic at the node then initiates acquisition of the resources 
needed to process the message 280 further. For example, if the received 
message is a Send message in an Infiniband™ network, logic initiates the 
20 fetch of a WQE to determine where in system memory to store the incoming 
data. A resource needed to process a received message may include, for 
example, storage buffers that are shared by logic at the node among a 
plurality of processes. Once resource acquisition is initiated by logic at the 
node, the resources will become available at a later time, according to 
25 whatever method Is used by the node to allocate resources. In a specific 
embodiment, the logic for resource allocation at the node may include an 
operating system. Receive message processing is then complete. 

In accordance with a specific embodiment, the coimnunicatlon network 
is Implemented in accordance with the Infiniband™ protocol. Each tag queue 
30 stores tags for all messages in the message store that were received on the 



same queue pair and are of the same type, where the type is either a request 
type or a response type. 

In accordance with another specific embodiment, the number of FIFO 
tag queues is equal to the ntonber of slots in the message store. 
5 Fig. 4 shows additional steps according to an embodiment of the 

present invention. When logic at the node has acquired the resources needed 
for processing a message in the message store and the tag for the message is 
at the head of a FIFO tag queue 185, the message is "ready" for dequeuing. 
Dequeue message processing begins 400. If multiple messages are ready for 
10 dequeuing 405, a priority arbitration process 415 selects the message to be 
dequeued. The selected message is dequeued from the message store for 
further processing 430. in a later processing stage. The tag is then removed 
from the FIFO tag queue and the message slot is marked "available" for 
further received messages 440. If messages are still ready for dequeuing, the 
15 dequeuing process continues 405. If no further messages are ready for 
dequeuelng, the dequeuing process is complete 450. This method of 
dequeuing the messages ensures that messages received from the same 
source with the same message type are processed in order, because the tags 
for such messages are loaded onto the same FIFO tag queue 185. The 
20 method also ensures that a delay in acquiring resources needed to process a 
message from a given source or of a given message type does not delay the 
processing of messages from other sources or of a different message type. 
This result foUows since tags corresponding to such messages will be loaded 
onto different FIFO tag queues: messages whose tags are loaded onto different 
25 tag queues can be dequeued and processed in an order different from the 
order in which the messages were received at the node. 

In accordance with a specific embodiment, the arbitration process 
employs a roimd-robin algorithm for selecting the next message for 
dequeuing, if multiple messages are ready for dequeuing. In the roimd robin 
30 approach, the FIFO tag queues are assigned an order. The next ready 
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message dequeued will be at the head of the tag queue next in order above 
the tag queue for the last message dequeued. Other arbitration algorithms 
can be appUed to select the next message to be dequeued, as are known in 
the art. Use of any of these arbitration algorithms is within the scope of the 
5 present invention. 

The present invention may be embodied in many different forms, 
including, but in no way limited to, computer program logic for use with a 
processor [e.g., a microprocessor, mlcrocontroEer, digital signal processor, or 
general purpose computer), programmable logic for use with a programmable 
10 logic device {e.g., a Field Programmable Gate Array (FPGA) or other PLD), 
discrete components, integrated circuitry [e.g., an Application Specific 
Integrated Circuit (ASIC)), or any other means including any combination 
thereof. In an embodiment of the present invention, predominantly aU of the 
reordering logic may be implemented as a set of computer program 
15 instructions that is converted into a computer executable form, stored as 
such in a computer readable medium, and executed by a microprocessor 
within the array under the control of an operating system. 

Computer program logic implementing aU or part of the functionality 
previously described herein may be embodied in various forms, including, but 
20 in no way limited to, a source code form, a computer executable form, and 
various intermediate forms [e.g., forms generated by an assembler, compiler, 
networker, or locator.) Source code may include a series of computer 
program instructions implemented in any of various programming languages 
(e.g., an object code, an assembly language, or a high-level language such as 
25 Fortran, C, C++, JAVA, or HTML) for use with various operating systems or 
operating envirormients. The source code may define and use various data 
structures and commimlcation messages. The source code may be in a 
computer executable form [e.g., via an interpreter), or the source code may be 
converted [e.g., via a translator, assembler, or compner) into a computer 
30 executable form. 
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The computer program may be fixed in any form {e.g., source code 
form, computer executable form, or an intermediate form) either permanently 
or transitorily in a tangible storage medium, such as a semiconductor 
memoiy device [e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable 
5 RAM), a magnetic memoiy device [e.g., a diskette or fixed disk), an optical 
memory device [e.g., a CD-ROM), a PC card [e.g., PCMCIA card), or other 
memoiy device. The computer program may be fixed in any form in a signal 
that is transmittable to a computer using any of various commimlcation 
technologies, including, but in no way limited to, analog technologies, digital 
10 technologies, optical technologies, wireless technologies, networking 

technologies, and internetworking technologies. The computer program may 
be distributed in any form as a removable storage medium with 
;3 accompanying printed or electronic documentation [e.g., shrink wrapped 
n soflware or a magnetic tape), preloaded with a computer system [e.g., on 
'p 15 system ROM or fixed disk), or distributed from a server or electronic buUetin 
board over the communication system [e.g., the Internet or World Wide Web.) 

Hardware logic (including programmable logic for use with a 
programmable logic device) implementing all or part of the functionality 
previously described herein may be designed ustag traditional manual 
20 methods, or may be designed, captured, simulated, or documented 

electronically using various tools, such as Computer Aided Design (CAD), a 
hardware description language [e.g., VHDL or AHDL), or a PLD programming 
language (e.g., PALASM, ABEL, or CUPL.) 

The present invention may be embodied in other specific forms without 
25 departing fi-om the true scope of the invention. The described embodiments 
are to be considered in all respects only as illustrative and not restilctive. 
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